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Abstract 

Models phrased though moment conditions are central to much of modern inference. Here 
these moment conditions are embedded within a nonparametric Bayesian setup. Handling such 
a model is not probabilistically straightforward as the posterior has support on a manifold. 

We solve the relevant issues, building new probability and computational tools using Hausdorff 
measures to analyze them on real and simulated data. These new methods which involve 
simulating on a manifold can be applied widely, including providing Bayesian analysis of quasi¬ 
likelihoods, linear and nonlinear regression, missing data and hierarchical models. 

Keywords: Decision theory; Empirical likelihood; Hausdorff measure; Markov chain Monte Carlo; 

Method of moments; Nonparametric Bayes; Simulation on manifolds. 


1 Introduction 

1.1 Overview 

Much of modern inference is phrased in terms of moment conditions and analyzed using asymptotic 
approximations. Here we build a new methodology which dovetails with decision theory. Moment 
conditions are embedded within a nonparametric Bayesian setup, allowing an individual to mix 
moment conditions with data and scientifically informative priors to make rational decisions without 
the recourse to the veil of parametric assumptions or asymptotics. 

Embedding moments within nonparametrics is not probabilistically straightforward. This paper 
spells out the issues, develops the corresponding probability theory to solve them and devises novel 

*We thank Isaiah Andrews, Yang Chen, Herman van Dijk, Mikkel Plagborg-Moller and Christian Robert for their 
comments on an earlier draft. 
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strategies for simulating on a manifold to implement them in practice on simulated and real data. 
It covers the case where it is hard, or indeed impossible, to solve the moment equations. This 
allows the rational analysis of moment condition models with many solutions. 


The scope of the new methods is vast. It deals with, for example, linear, nonlinear and in¬ 
strumental variable regression. By thinking of the moment condition as the score of a parametric 
statistical model, our analysis also provides a Bayesian treatment of quasi-likelihood methods which 
are widely applied in statistics (e.g. Cox (1961), White ( 1994|)). Finally, this framework provides a 


solid basis to deal systematically with missing data (e.g. Little and Rubin (2002)), shrink param¬ 
eters (e.g. Efron ( 2012| )) and build hierarchical models (e.g. Gelman et al. (2003)). 

1.2 The conceptual challenge 


It will be helpful in our discussion of the paper’s contribution and to place it in the context of the 
literature to establish some notation; a formal statement will appear in Section 

Assume one has independent and identically distributed (i.i.d.) d-dimensional data Zi, i = 
1,2, ...,n, taking on the known support si,S 2 , ...,sj and having distribution function F. We then 
write "^{Zi = Sj|0,/3) = 9j where the p-dimensional /? satisfies the r-dimensional moment condition 

Ez {g{Z, 13)} = f g{z, /3)F{dz) = ^ ejg{sj,P) = 0. (1) 

i=i 

Here [3 is the parameter of scientific interest. We then view 0 = (0i, 02) •••) ^j-i)^ (with 9j = 
1 — i'0, where t is a vector of ones of appropriate size) as nuisance parameters to be treated 
nonparametrically. The task is to learn p(/3, 9\Z) or p{P\Z), where Z = (Zi, Z 2 ,..., Z„)'. A simple 
example of this is g{sj,(3) = Sj — (3 which delivers the mean. 

Although this problem is easy to state, it is not easily carried through, as traditional nonpara- 
metric models clash with the moment conditions, in effect overspecifying the model. Expressing this 
in a different way: the prior and posterior for (3, 9 are typically supported on a zero Lebesgue mea¬ 
sure {J + p — 1 — r)-dimensional set, Qjs^g, in As a result, traditional Markov chain Monte 

Carlo (MCMC) methods (or alternatives like importance sampling) for sampling from p{l3,9\Z) 
entirely collapse. This paper solves this problem in two different ways: the comparative advantages 
of each will depend upon the form of the moment conditions. Taken overall this paper provides a 
unified solution to this central problem. 


1.3 Literature on classical analysis of moments 

Before we detail our new approach, we will discuss how this work relates to the literature. 
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Moment based estimation was introduced by Pearson (1894). A relatively modern version of 


this procedure first estimates 6 nonparametrically, that is F by the empirical distribution function 
Fn, and then plugs it into ([^, yielding the function 


/ 


g{z,l3)Fn{dz) = '^ejg{sj,P). 
i=i 


In the p = r case we move /3 around until this function equals a vector of zeros, delivering the method 


of moments estimator (3. Extensions include, for example. 

Sargan 

(1958, 195S 

), I 

Durbin 

1960), 

Godambe (1960|, Wedderburn (1974), McCullagh and Nelder 

(1989) 

Hansen (1982) 

, Chamberlain 

(1987), Hansen et al. (1996), Gallant and Tauchen (1996) and 

Gourieroux et al. ( 

1991 

5). Hall 

(2005) 


gives a recent review. 

An elegant implementation of moment based inference is through empirical likelihood. Moti¬ 


vated by Owen (1988, 1990), Qin and Lawless (1994) and Imbens et al. (1998) discussed empirical 


likelihood based inference in overidentified moment condition models. See also the reviews by|Qweri| 


(2001), Kitamura (2007) and Lancaster and Jun (2010). 


1.4 Literature on Bayesian analysis of moments 


Our work is fully Bayesian. Much of our work has been inspired by Chamberlain (1987) and in 


particular Chamberlain and Imbens (2003). Chamberlain and Imbens (2003) place a Dirichlet prior 


on 6, which implies the posterior on 9 is Dirichlet. These priors and posteriors are straightforward 


to sample from as noticed by Rubin (1981) in his Bayesian bootstrap. Chamberlain and Imbens 


(2003) suggest that for each posterior draw of 9 they would solve the moment conditions to imply 


a value (or in principle a set of values) of p. Collecting a sample of such solved values provides a 
sample from a posterior on /3. Unfortunately these authors have no control over the prior for (3, 
the parameter of scientific interest. 


Also important is Kitamura and Otsu (2011), who have two methods, both expressed in terms 


of Dirichlet process priors. Here we convert them into our finite framework. In their exponentially 
tilted case they first specify a prior p{(3)p{9) before finding 9* = {9\,92-, ■■■■,9*j) which minimizes 
Z]j=i subject to the moment constraints '^j=i9*jg{sj, fi) = 0 and the probability ax¬ 

ioms. They then set F{Zi = Sj\9,/3) = 0^, using this model to learn (3 and 9 from the data. iShin 


(2014) carefully investigates various computational aspects of this approach. This approach has 


many advantages but it leaves pairs of /3 and 9 with positive posterior probability which are not 


logically compatible. Kitamura and Otsu (2011) also propose a synthetic Dirichlet process (with 


connections to Doss (1985) and Newton et al. (1996)). 
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There are also many papers which provide alternative methods, including a substantial literature 


on the Bayesian use of moments through approximate methods. Chernozhukov and Hong (2003) 


specify a quadratic form in the moment conditions and use this as the basis of a log quasi-likelihood 
function. They then use this approximate likelihood to carry out Bayesian inference using MCMC 


alongside a sandwich estimator. Related work includes Yin (2009). Muller (2013) provides a 


Bayesian version of the asymptotic sandwich matrix commonly seen in quasi-likelihood inference 
and links it to decision theory. 


Lazar (2003), Schennach (2005) and Yang and He (2012) provide Bayesian interpretations to 


empirical likelihood and study the resulting properties. Mengersen et al. (2013) look at moment 


conditions and empirical likelihood using approximate Bayesian computation. See also Zellner 


(1997) and Zellner et al. (1997), who suggested a Bayesian moment method by building a likelihood 


defined through the maximum entropy density consistent with the moment conditions. Related is 


the Bayesian work on factor and cointegration models, e.g. Strachan and van Dijk (2004). 


In a series of papers Gallant and Hong (2007), Gallant et al. (2014) and Gallant (2015) develop 


methods which devise a prior using hducial arguments from moment conditions. Related work 


includes Jaynes (2003) and Kwan (1998). Florens and Simoni (2015) have used Gaussian processes 


in combination with moment constraints to carry out Bayesian inference. 

1.5 Computational issues 

Here the prior and posterior for /3, 9 are supported on a zero Lebesgue measure {J + p — 1 — 
r)-dimensional set, in Hence Bayesian inference will need us to sample from a 

distribution defined on a zero measure set, rendering standard Monte Garlo methods useless. 


In an influential paper Gelfand et al. (1992) use MGMC methods to deal with constrained 


parameter spaces, but in their paper the constraints do not change the dimension of the support. 


Hum et al. (1999) carry out MCMG in constrained parameter spaces (sampling from a distribution 


tt{x) subject to a constraint C{x) = 0) using block updating. Golchi and Campbell (2014) carry 


out sampling subject to constraints using sequential Monte Carlo methods by slowly introducing 


the constraints. However, they do not explore the change of measure issue we discuss here. Chiu 


(2008) use a singular normal distribution in posterior updating for an under-identified hierarchical 


model. Related work includes Sun et al. (1999). Overspecified factor models also have some of these 


features, as discussed by West (2003). Fiorentini et al. (2004) face related but highly specialized 


challenges when sampling missing data in a GARCH model. 

There are few recent papers on MCMC simulation from distributions defined on manifolds. 
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Brubaker et al. (2012) propose a Hamiltonian Monte Carlo on implicitly defined manifolds. Numeric 


integration of the Hamiltonian dynamics requires solving a system of 3d nonlinear equations for each 
update, where d is the dimension of the space in which the manifold is embedded (in our setting 


d = J + p — 1). Byrne and Girolami (2013) introduce a Hamiltonian Monte Carlo simulation 
algorithm for sampling from manifolds with known geodesic structure. They demonstrate how 
this algorithm can be used in order to sample from the distributions dehned on hyperspheres and 


Stiefel manifolds of orthonormal matrices. Diaconis et al. (2013) provide a short review of concepts 
in geometric measure theory. They discuss algorithms for sampling from distributions defined on 
Riemannian manifolds that are similar to the “marginal method” that will be introduced shortly. 
It is this paper which has been the most helpful to us in terms of Monte Carlo methods. 


1.6 Outline of the paper 

In the next section of the paper we will introduce the formal model under study, and discuss how one 
specihes meaningful prior distributions on the parameters of interest. In Sectionseveral methods 
for inference and their relative merits and pitfalls are discussed. Section discusses mechanisms 
for generating priors for these models. We also draw out how to make inference when the support 
of the data is unknown, regarding the unseen support as missing. This is followed by Section 
in which some illustrative examples are demonstrated. Section explores several empirical studies 
before Section [^concludes. An Appendix collects the proofs of the propositions stated in the paper 
and a collection of additional results. 


2 Bayesian moment conditions models 

2.1 The model 


Assume the data we have available to make inference is Z = (Zi,Zn), where the Zi are d- 
dimensional i.i.d. draws from an unknown distribution which has J points of known support 


{si, S 2 ,..., sj} = S (we relax this known support condition in Section 3.6). Throughout we write 


¥{Zi = Sj\e,^) = 9,, j = l,2,...,J, (2) 

with 9 = ( 6 * 1 , 02 ) £ © 6 » ^ where ={9 = (0i, 02 ) ••■) >''9 < 1 and 9j > 

0} for all j and 0j = 1 — d9, in which i is a vector of ones. Further, the science of the problem is 
characterized by the values of /3 which solve the r unconditional moment conditions, 

J 

'^9jg{sj,l3) = 0, (3) 

i=i 
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where /3 G 0^ C M?* and g : Mr x MP ^ W. Typically the scientific conclusions will center around 
inferences on (3, although predictive type inference may also additionally feature 6. This paper 


concentrates on the case of exactly identified models (r = p). Appendix A.7 extends to the more 
general case of over and under identification at the cost of more clutter but without having to 
generate any new ideas. 


2.2 Parameter space and prior 


Throughout this paper we will think of [3 and 9 as parameters to be learned from the data, Z. We 
write the J + p — 1 parameters 

(/3',0')'e0/3,e, 

where C x C as the joint support for (3 and 9. Each point within 

is a pair (/?, 9) which satisfies both the moment conditions and probability axioms. The moment 
conditions are: 

H^9 + gj = 0 where Hp = {gi,gj_i) - gjt , 


in which gj = g(sj,l3) (for 1 < j < J). Moreover Hp is assumed to be of full row rank (we 
will often suppress the dependence on (3 and just write H). These constraints, together with the 
inequalities 9j > 0 (for j = 1, 2,..., J), implicitly define the (J — l)-dimensional set of parameters 
within which will be denoted by Hence the parameter space, depends upon 

the support of the data, S = {si, ...,sj}, but is not data dependent. Throughout the paper, the 
notation 0;, will generically represent the parameter space of A in which A is a set of parameters. 

The set of admissible pairs (/?, 0), denoted by 0/3,e, is a zero measure set (with respect to 
Lebesgue measure) in We will assume that researchers can place a prior density, p{/3,9), 

with respect to the (J — 1)—dimensional Hausdorff measure on 0/3,g. Using the Hausdorff measur^ 
as the base measure, we are able to assign measures to the lower dimensional subsets of ^, 
and therefore we can define probability density functions with respect to Hausdorff measure on 
manifolds (and more complex zero Lebesgue measure sets) in an Euclidean space. 


^Assume E C R", d £ [0, +oo) and S £ (0, +oo]. The Hausdorff premeasure of E is defined as follows, 


Hi{E) = - 


inf ^ ( 

ECUEj 

d(Ej)<S 


f diam(i?j 


where Vm = 


r(i)‘* 

2hATTT) 


is the volume of the unit d-sphere, and diam(i?j ) is the diameter of Ej. 'Hs{E) is a nonincreasing 

function of <5, and the d-dimensional Hausdorff measure of E is defined as its limit when S goes to zero, 'H'^{E) = 
linii_>o+ T3s{E). The Hausdorff measure is an outer measure. Moreover defined on R" coincide with Lebesgue 
measure. See Federer (19691 for more details. 
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Figure 1: In the plot on the left the blue curve, (3 = log j! is the parameter space of the logit 
model, In the plot on the right the density of the prior p{l3,0) (with respect to Hausdorff 

measure) is depicted. This density lives on the blue curve which supports 


2.3 Some examples 

To cement this we have built a starkly simple example which captures most of the challenges in 
this problem. It faces off a nonparametric model against a scientific parameter of interest. 


Example 1 (Logistic) Assume Zi\9 ~ Bernoulli(0), and let (3 = log = logit(0) be the 

scientific parameter of interest. Jointly f3,9 captures the inherent singularity implicit in all moment 
based inference. The moment condition is 


g{s,/3) = s- 


e^ 

l + eh' 


Therefore the parameter space, 0/3,0, is 

0/3.0 = |(/3, 0) G M X [0,1]; ^ = log I . 


This is shown as the blue curve sitting at ground level in the left panel of Figure(^ Of fundamental 
importance is that if 9 moves by d0 then the length of the journey along this curve will be (by 
Pythagoras’s theorem) 


d9^1 + Jl Je 





The right panel of Figure repeats the support but now above it is a (the form of the density is not 
expositionally important at this point) density p{/3,9) with respect to this curve, or more formally 
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the one dimension Hausdorff measure on Qpfi. Then for any set C C 0/3,0 


Pr {(/3, 9) G C} = p{fd, 9)^1+ 'd0, 

where Cg is the projection of C on 9’s axis (i.e. we integrate over all values of 9 which imply a fi 
such that the pair (/3, 9) G C). This means as we integrate over 9, we must multiply the density on 
the curve by the length of the curve. 


We will study how to transform this prior p{j3, 9) into a posterior and simulate from it. This 
will allow us to learn /3 from the data. As with all Bayesian calculations, it is not trivial to establish 
a widely acceptable prior p(/3, 9). We will return to that very practical issue in Section]^ 

Before we leave this section we give a less artful example. 


Example 2 (Mean) Let Z be a scalar random variable and g{s,/3) = s — fd, so jd is a mean. Then 

( J 

0/3,e = s (/3)^) ; = /5) > 0 for oil j, and i'9 < 1 

I i=i 


Thus Qpfi is a region within a {J — 1)-dimensional hyperplane in However all elements of this 
set are not admissible, since 9 should satisfy the probability axioms (elements of 9 should be positive 
and 1 — l' 9 > t)). Therefore the parameter space 6 / 3,0 is a convex subset on the hyperplane. Then 
if 9 moves by d^i, ...,d0j_i the area of the corresponding parallelogram on the hyperplane is 


d0i...d0j_i^l +J 0 J', J, = 


dp 


where dpjddj = Sj — sj, j = 1, 2,..., J — 1. So for any measurable set C C 6 / 3 , 0 , 


Pr{(/3,0)GC} 


oc 



J-i 

1 + ^ {sj - Sjfd9 



where Cg is the projection of C on 9 (The last proportionality is due to the fact that the .Jacobian 
only depends on the support of the data). Thus the linearity of the moment condition (that results 
in a flat parameter space 0 / 3 , 0 ^ translates into a somewhat trivial multiplicative correction factor 
and so yields a simple relationship between Pr{(/3,0) G C) and p{P,9). 


Example 3 (Regression) The previous example can be generalized to the family of regression 
models. For instance consider a linear regression model, E (sd)|s(2)^ _ where s = 










in which is a scalar and is a d-dimensional vector, and 13 is a p-dimensional vector of 
parameters. The linear regression parameters solve the following moment condition equation, 




= 0 . 


We can also discuss the estimation of linear regression model with instrumental variables. Assume 
s = where is a scalar, and s^^^ and are p-dimensional vectors (indepen¬ 
dent and instrumental variables, respectively). If we define g{s,(3) = — I3's^‘^'>), then [3 is 

the solution to E[ 5 r(s,/ 3 )] = 0. Moreover generalizing to the nonlinear regression model is easy. 
Assume E = p{s^‘^\(3). Then the corresponding moment condition equation is g{s,(3) = 

5 ( 2 ) |s(i) _ For instance for a Poisson regression g{s,j3) = — exp(/3's*^^^)}. 


Example 4 (Average treatment effect) Consider a casual inference problem with the obser¬ 
vational data Zj = (Xj,Yj,Wj) (for 1 < j < N), where Xj is the K-dimensional vector of the 
j-th unit’s background variables, Yj is its scalar outcome variable, and Wj is the binary treatment 
indicator. Assuming the super-population unconfoundedness, it can be shown that {Imbens and Ru- 


score, e{Xj) = r]j = Pr(iyj' = 3\Xj). Therefore the average treatment effeet (ATE) is 


bin 


(2015)) EsP K-(l)] =E 


e{xp 


and Esp [i^'(O)] = E 


l-eiXj) 


where e{Xj) is the propensity 


T = Esp K-(1)] - Esp ^-(0)] = E 


{l-Wj)YA 

l-e{X,)J’ 


One might use a logistic regression model for the propensity score, rjj = exp( 7 'Xj)/ {1 + exp( 7 'Xj)}, 
where 7 is K-dimensional. Under these assumptions the model’s parameters, (3 = (7, t ), solve the 
following set of moment conditions. 


E[g{Zj,f3)]=E 


Xj{Y,-vA 


iW-Vi)Y 


— T 


= 0 , 


L 

If we assume the data points are i.i.d. realizations from a discrete distribution with finite and known 
support S = {si, Pr(Zj = Sj) = 9j, the moment conditions are, 

Jx 


E[g{Zj,/3)] = 


giWzihfXi 

u 




— r 


= 0 . 


Thus the propensity scores and the ATE can be estimated jointly (e.g. McCandless et al. (2009), 
Zigler et al. (2013) and Zigler and Dominici (2014]}) ■ 
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3 Inference 

3.1 Likelihood and posterior 


Under the assumptions formulated above, the model’s likelihood is 


J 

i=i 

where Uj = ^{Zi = Sj). Note that although (3 does not appear in the likelihood explicitly, 

due to the constraints on /3 and 0, the data is informative about (3. 

The posterior is supported on the same set as the prior, and may be written as 


p{f3,e\z) (xp{p,e) 



( 4 ) 


The terms in Q are easy to compute for any {/3,9) in but the support is defined implicitly. 


3.2 Accessing the posterior 


Inference can be carried out by sampling from the posterior distribution of the parameters. However, 
in this problem, traditional simulation algorithms will fail because the prior and the posterior of the 
model are supported on a zero Lebesgue measure set (e.g. all the proposed moves of a Metropolis- 
Hastings (MH) algorithm with a traditional proposal will be rejected almost surely). 

Here two solutions to this problem are given. In the first approach, called the “marginal 
method”, we will derive the density function of the marginal of 9, which has a density with respect 
to the Lebesgue measure p{9) and therefore can be processed by conventional Monte Carlo methods. 
Examples include standard MCMC algorithm and importance sampling. This is simple but comes 
at the cost of having to solve for j3 for each proposal. If finding f3 (or indeed all the values of (3 
which solve given 9) is cheap then this provides a very solid solution to the problem. 

In the second approach, called the “joint method”, we define a proposal in the space of {(3,9) 
that assigns positive probability to Qpp (so, with positive probability, the proposed moves remain 
on the manifold Qfjp and will be accepted). An MH algorithm with this proposal is able to 
efficiently move in the space. This does not require us to solve the moment conditions at all, which 
is extremely attractive for difficult to solve moment condition models. 


3.3 Marginal method 

Let p{j3, 9) be the density function of the model’s prior or posterior with respect to Hausdorff 
measure on Qpfi. Proposition gives the marginal density of 9 with respect to Lebesgue measure. 
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This implies that standard Monte Carlo methods (e.g. MCMC, importance sampling, sequential 
importance sampling and Hamiltonian Monte Carlo) can be usecQ 


Proposition 1 Letp{f3,9) he the density function of the prior or posterior with respect to Hausdorff 
measure supported on Qpfi. Moreover, assume p = r (the “just identified” case) and (3 is uniquely 
determined by 6, i.e. j3 = /3(0). Then the density function of 9 with respect to Lebesgue measure is 


p{9) = \J\JeJg + Ip\ p{f3,9), 


where 


(9/3 f ( Qq \ I — ^ 

W ^ Hp = {gi,...,gj_i) - gjd, 

with i being a {J — l)-vector of ones and 


( 5 ) 


( 6 ) 




i=i 


dgisj,l3) 

' d/3' ■ 


This proposition is a direct result of the “area formula” of [Federer (1969) (see also Diaconis 


et al. (2013)) and it can be generalized straightforwardly to the cases where for some values of 9 


there exist more than one (3 by summing over the right hand side for each solution in f3. 

The Jacobiarj^ term depends on the geometry of the parameter space Qjs^g (in 

other words, it only depends on the moment conditions) and is independent of p{/3, 9). To compute 
this term we need to invert apxp matrix and evaluate the determinant of a p x p matrix. However, 
p is usually small, in which case the computational cost of these operations is negligible. 

Importantly knowledge of the functional form of /3 as a function of 9 is not needed, since the 
partial derivatives can be obtained by the implicit function theorem. However, in order to evaluate 
this density function for a given 9, we need its corresponding /3. Although in some problems (3 has 
a known analytic form as a function of 9, in many other situations it can be obtained through a 
numeric optimization. We now return to the examples introduced in Section 


Example 1 (Continued) The density of 9 in the logistic model is 

p{9) =p{/3,9)^ 


\ 


1 + 


Slog 


09 


( 7 ) 


^We sample from the unconstrained p{rj), where = log (Oj+i/Oj), for j = 1 ,J — 1, with \d6/dg\ = 11/=! 

Jacobian correction terms also appears in reversible jump MCMC (e.g. Green (19951), when the chain is allowed 
to jump between models with different number of parameters. However there the (one-to-one) transformations are 
operating between spaces of the same dimension, and the distributions in both spaces have densities w.r.t. Lebesgue 
measure. On the other hand, the Jacobian in Proposition corrects for a one-to-one mapping between spaces of 
different dimensions and relates two densities that are defined w.r.t. different reference measures. 
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Figure 2: Projection to the marginal density for 9. Blue density is the correct marginal den¬ 
sity p{9), given in with respect to Lebesgue measure. The grey density is the naive density 
p[(3 = log {6/ (1 — 0)} , 9] which ignores the corresponding length of the support. 


Thus moment condition impacts the marginal prior on 9. Figure^ shows the function p{9), which 
has blue shade below the curve, together with the naive p{l3 = \og{9/ (1 — 0)} ,9), which has grey 
shade. We can see the correct density is higher for high values of 9 as there are more dense values 
of fd compatible with high values of 9 than when 9 is close to 0.5. 


Example 2 (Continued) The density of 9 in the mean model is 

p(9) =p{l3,9) 


1 


,/-i 


i=i 

Hence in this case the geometry of moment condition does not impact the prior on 9. This will be 
the case generally when the parameter space, Qfsp, is flat. 

Example 3 (Continued) For the regression model write gj = g{sj,l3) for 1 < j < J. Therefore 

-1 


d/3- 


,( 2 ) 12 )/ 






J 


J2) J2)/ 


{oi - gj) 


Moreover 


-1 


-1 




JeJ'e — ( 




X] - gj) {gi - gj)' ] | X] "i 

i=i 


. 2=1 
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Similarly for the linear regression model with instrumental variables we have, 

^ = -sfsf', and {9^-9J), 


dOi 


vi=i 


and therefore 


-1 


-1 


JeJe = 1 1 £ - 9j) {9i - 9 j)' \ ^ 0. 


,(3)J2)/ 


vi=i 


. i=l 




( 2 ) 

Again generalizing to nonlinear regression models is straightforward. If we define fij = s) ), 


then 


which implies 


dgj _ (2)dfi 
dfi' ~ dfi' 


dfi 


{2)9 hj 


-1 




y3=l 


dfi' 


JeJ'e= 


vi=i 


S2)dlfi 
dfi' 





i9i - 9j) {9i - 9.])' M X] 


i=l 




S2)dlfi 
dfi' 


-1 


For instance for ia{j3, = ex p(/3'(2)) 

we have, 

dfi' ~ ^ ^ ^ ’ ddi ■ 


-1 




vi=i 


and hence 


1 


JeJ'e = I 

vi=i 



-1 


Y “ 3j)' \ Y ^ 


i=l 


vi=i 


Example 4 (Continued) For the casual inference problem write gj = g{sj,f3), for 1 < j < J. 
Then 

I n_. I /I n_. I \ 

dfil 

dfi' 


Okxi 

. dB 

f 

E/=i - i]j)sf'^ Qkxi 

) 

Olxif —1 

7 ‘ 

• 8^ = 


OixE" —1 





/ 


which implies 

JeJ'e = 


Yi - hj)sY 

OlxK 


:U - pj)sY 

OixX 



Y - 9j) {9i - 9j)' 


2=1 


{9i - 9j), 
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An immediate consequence of Proposition!^ is that if we reparametrize the scientific parameters 
of interest '0 = V’(/3) using a one to one transform, then 


^( 0 , 0 ) 



( 8 ) 


where p{ip, 9) and p(/3, 6) are densities with respect to Hausdorff measures. 


3.4 Joint method 


Alternatively, we may draw random samples directly from the posterior of (/3, 9). This distribution is 
supported on a zero Lebesgue measure set, 0/3,6) with the density function (with respect to Hausdorff 
measure) p{(3,9). If we ignore this and propose moves from a continuous proposal distribution in 
instance a Gaussian proposal), the proposed moves are off the support of p(/3, 9) almost 
surely, and they will be rejected with probability one. Therefore in order to sample from p(/3, 9) we 
must find a proposal distribution that assigns positive probability to 0/3,6. Drawing random samples 
from this proposal should be easy and fast and (in order to compute the acceptance probability) 
we should be able to evaluate its density function. This subsection will explain how this can be 
achieved. 

For a given value of /3, the moment conditions imply the affine constraints on 9: 


Hp9 + = 0. (9) 

Therefore 06|/3 is a (J — l)-hyperplane in This property allows us to define a suitable 

proposal distribution for {(3,9). Assume the current state of the MCMC is (/3^*\0^*^). First we 
explain how a random sample from the proposal can be drawn, and then will show how the density 
of this proposal can be evaluated. In order to draw a random sample from q{-, ■\(3^^\ 0^*^), 


1. 

2 . 


Draw (3*\(3^^\9^^'^ from an (almost) arbitrary proposal q{-\(3^^\9^^'^). 


Draw 9* from a singular distribution supported on the hyperplane V* = {X £ + 

Pj = 0}. We denote the density of this distribution (with respect to the Hausdorff measure) 
by Q{-\f3^^\9^^\ (3*). Moreover we assume the density can be easily evaluated at any 9*. A 


singular Normal distribution supported on V* is one suitable choice (see Khatri (1968)). 


In the Appendix A.3 we provide a way to determine the parameters of a singular Normal 
distribution that can be used to propose for 9*\(3^^\9^^\ (3*. 
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So far we have shown how a random proposal can be generated from The 

following propositions demonstrates how the density of this proposal can be evalnated when p = r. 


Proposition 2 Let p(l3,0) be the density of {/3,9) with respect to (J — 1)-dimensional Hausdorff 
measure on QfSfi. Moreover assume the density of fd with respect to Lebesgue measure isp{j3), and 
the density of9\f3 with respect to Hausdorff measure is p{9\(3) on Qo\g, where Qe\i 3 a hyperplane. 
Then ^ 

pi/3,9) = ——Y pi/3)pi9\l3), where Je = (10) 

iJeJe + Ip^^ 

The proposed pairs i/3*, 9*) satisfy the moment conditions, however the probabilities may not 
satisfy the probability axioms (as some of 9* may be negative or = 1 — i'9* < 0). Obvionsly in 
these cases the proposal is rejected (since the posterior is zero), the MCMC algorithm sticks, and the 
proposal’s density need not to be evalnated. If the proposal is valid, then the move (/3, 9) —>■ (/3*, 9*) 
is accepted with probability 


r pif*,9*\Z)qif3,9\f*,9*) \ 

\ ’ pi/3,9\Z) qi^*,9*\/3,9)j- 


( 11 ) 


The terms inside this acceptance probability are straightforward to compute up to proportionality. 

Note that in the joint method we do not need to solve for f3 in each iteration of the simulation, 
because our proposed moves are elements of the parameter space 0 / 3 , 0 . Moreover, when J goes to 


infinity, the Jacobian term in (10) converges to 1. To see this assume the data generating process 
is a continuous distribution or a discrete distribution with infinite support, Sj ~ H. Then, with 
probability one, just using a strong law of large numbers, 

-1 


1 


1 


J 


dg 






dg 


where J = jEg | ^uigjg'j) |e 0 


-1 


3 

. Therefore 




-1 


1 , 


\JeJ0 + Ip\ I + 7-^pI 

with probability one as J goes to infinity. This asymptotic approximation could be used to simplify 
the computation of the acceptance probability, but otherwise does not change the substance of this 
section, as proposals will be made in the same way — directly on the manifold. 
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3.5 Relationship to the Bayesian bootstrap 


The Rubin 


■‘Bayesian bootstrap” is at the core of Chamberlain and Imbens (2003). We 


can implement our Proposition by using their Bayesian bootstrap as a proposal which can be 
reweighted to allow for informative priors on /3. Throughout we assume /? can be solved given 9. 


Our generalization of Chamberlain and Imbens (2003) starts with the Dirichlet prior '7r*(0) oc 
n/=i a > 0. The Bayesian bootstrap then simulates from the proposal density, 

J 

g{e\Z) (12) 


i=i 


We assume the researcher does this M times, writing the draws as \ 6^^' > 

I J k=l,2,...,M 

we assume there is a unique which solves the corresponding moment conditions. 


For each 


Chamberlain 


and Imbens (2003) stop at this point, using this sample as a Monte Carlo estimate of the posterior. 


Correcting for the geometry of the problem, the actual posterior is 

J 


p{9\Z)<xp{f5,e) \JeJ'e + Ipr^. 


(13) 


Vi=i 


The resulting weights from the true posterior density with respect to the Lebesgue measure dividing 
by the density from the proposal are 






n;=i ^ 


](^) 


Q—l 


k = l,2, 


,M, 


(14) 


(where is equal to Jq evaluated at which normalize as ■ 

An encouraging aspect of this weight is that it does not depend on the data. 

In the special case where p{j3, 6) oc 7r(/3)7r*(0), the weights may be simply evaluated as 

r(/3(")) 


oc 7r( 


j{k) j{k)' . J 

-'p 


k = 1,2,...,M. 


(15) 


We can use these weights to estimate E(/i(/3)|Z) ~ This is importance 

sampling, e.g. Marshall (1956), Geweke (1989), Liu (2001). An alternative is to resample with 
probability proportional to the weight which delivers sampling importance resampling (SIR, 
Rubin ( 1988|)). As with all importance samplers, the weights may become uneven although the 


see 


simplicity of the structure of the weights is encouraging. This sampling strategy becomes appealing 
in the models where the /3 can be computed easily for any 0, and the prior distribution of /3 is not 
too far from the posterior obtained from the Bayesian bootstrap. 
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3.6 Missing support 


So far we have assumed the support of the data is known. Here we extend this to assume the support 
has J* > J elements, S* = (si, where its hrst J = J* — J elements, S = (si, have 

not been observed in the sample, while the rest of its elements S = ..., sj) have been observed 

at least once. Moreover let 6* = (0,9), where 9 and 9 are the vector of the probabilities of the 
elements of S and ..., sj-i), and define 9j* = 1 — Ylj=i^ ^*j- We assume the missing elements 

of the support are i.i.d. draws from Fs, Sj Fs for j = 1,..., J, with density fs with respect to 
Lebesgue measure. The moment conditions are then 


while the posterior is 


J2^jgisj,/3) = 0 , 

i=i 


p{i3,9*,s\z)<xp{i3,9*,s) I 


(16) 


where rij = = Sj). Note that nj = 0 for 1 < j < J, while rij is positive for J+1 < j < J*. 

Assume the researcher expresses a prior on (/3,0*)|S' with respect to the Hausdorff measure (sup¬ 
pressing the conditioning on S for notational convenience). 


p{P,9*\S). 


(17) 


Therefore 


p{l3,9\s)= 




(18) 


Given 9* and S (and S'), /3 is uniquely determined. Therefore the core result we need to do inference 
is a generalization of Proposition the density of the probabilities and the missing support with 
respect to the Lebesgue measure is 


p{9*,S) 



p(/3,r|s). 


(19) 


where 


Jr 


M 


dp 

W 



dp'j 


-1 


H, 


/3’ 






M, 


( 20 ) 
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Again this result follows from the area formula. Proposition [^generalizes in the same way delivering 

_ ijj I j Q tj ^ 


p(/3,r,5) = 




Je*Jg* + ^s'^'s 


~ p{l3)p{e*\l5,S) j 

i=i 


( 21 ) 


Again the Jacobian will be close to one if J* is large. The ratio of J to J* does not make any 
difference to this approximation. 

Example 2 (Continued) Now add a single point of missing support. Then J* = A, 6* = 

, J = 3 and S* = { 51 , 52 , 53 , 54 } = {- 1 , 0 , 1 , 54 }. Then [3 = 6-^ - 9i + 54^4 = 
9^ — 9i + 54(1 — 9i — 92 — 9^). For this model 


T ^ u T - ^-a 

Jq* — QQ^ — ( 1 ^ 4 ? < 54 ; 1 54 ) and Jg — — c? 4 , 


( 22 ) 


and so JqJq = 2 + 854 and JgJ'g = Hence, writing 9^ = 1 — 9i — 92 — 9^, 
p{s4,9*) = / y^( 2 + 35|) + + ll fs{sA)p{f3, 0 * 154 ). 


(23) 


4 Some potential priors 


So far we have discussed working with any prior p(/3, 9) which is defined with respect to lower 
dimensional Hausdorff measure supported on QjSfi. In this section we discuss potential ways of 
selecting p{(5,9). As with all prior selection there is no uniquely good way of carrying this out. 

4.1 A non-science prior 

From a nonparametric standpoint it is natural to build a prior from p{9), e.g. Dirichlet. Then 
Proposition implies there is a unique joint prior 

p{9) 


p(/3,0) = 


\J\JeJ0 + Ip 


(24) 


which achieves this. The right hand side p{9) is the density of 9 with respect to Lebesgue measure, 
while p{f3,9) is the density of (/3, 9) with respect to Hausdorff measure. This implies 


Pr{(/3,0) G C} = / p{9)d9. 

Jce 


(25) 


The Dirichlet special case (24) is the implicit Chamberlain and Imbens (2003) prior on p{f3,9). 


18 














4.2 A prior on the science 

Proposition says that 


p(/3,6') = 


\jdJQ + If 


11/2 


■pi/3)p{0\/3). 


(26) 


If we place a prior on the science p(/3) with respect to the Lebesgue measure, then we can form 
a scientifically centered prior on p(/3,0) by specifying a prior on p{0\l3) with respect to the J — 
1 — p dimensional Hausdorff measure. This prior sits on the hyperplane 0|/3 satisfying the linear 
constraints Q and the probability axioms. One such prior is Dirichlet subject to the constraints. 


Again if J gets large the Jacobian in (26) will become unimportant in practice. 


4.3 Adhoc priors 

A more brutal approach to building a prior is to define an “initial” prior (with respect to Lebesgue 
measure) for (3 and 9 which ignores the moment condition ri{/3, 9) where the implied initial marginal 


prior on /3, could be our substantive initial prior. From the Borel paradox (Kolmogorov 


(1956)) we know there are many ways of building a p(/3, 9) from r]{f3, 9) (conditioning on satisfying 
the moment condition is not enough) but here we discuss various plausible methods. 


This line of thinking leads to a generalization of (24), setting 

pi(3,9) 


pil3,9) oc 




(27) 


{JeJe + Ip\ 

This prior scales the initial prior to countereffect the length of the curve mapping out the rela¬ 
tionship between 9 and [3 implied by the moment condition. This prior has the property that 
p{9) oc r]{f3,9)lsi^ g{l3,9), with respect to the Lebesgue measure. 


The simple case of r]{(3,9) = r]{(3)ri{9), would imply under (27) 


p{9) oc 7?(/3)r/(6i)l0^_g(/3,6»). 


(28) 


The case where 7/(0) is Dirichlet is important. Then the Bayesian bootstrap weights (27) would 
become the rather simple 


w. 


oc0(/3^^^), j = l,2,...,M. 


(29) 


This is a minimally informative generalization of Chamberlain and Imbens (2003) 


An alternative to (27) is to put no mass on inadmissible combinations of /3, 9. We call this the 
“truncated prior” 

p(/3,0) oc 7/(/3,0)le^ g(/3,0) (30) 
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Figure 3: The parameter space (the blue curve) and the initial prior 7r(/3, 9) have been depicted. 
Figure shows the implied p{P,9). 


in which p(/3, 9) is the density of the prior with respect to the (J — 1)—dimension Hausdorff measure 
in This wonld imply for any set C G 

Pr{(/3,0)GC} = [ pifi,9)^\JeJ'+Ip\d9 (31) 

Jcq 

« [ viP,G)\/\J8Je + Ip\d9. 

JCe 

Obvionsly it implies p{9) oc r/(/3, 9)^J\JoJQ + Ip\^ep (,(/3,0), with respect to the Lebesgue measure. 
Example 5 (continuing logistic Example^. Assume the initial prior 

rji(3, 9) oc 9^-^^-\l - (32) 


which is a relatively ignorant Dirichlet prior on the probabilities and an informative Gaussian prior 
for (5 centered on one. This is depicted in Figure\^ With this initial prior and using the class of 


priors (30), the density with respect to the univariate Hausdorff measure is 

p{(3,9) oc r/(/3,6»)le^ g(/3,6'). 

Figure^ shows the corresponding p{j5, 9) living on the manifold. In this case 


(33) 


p{9) oc r]{f3,9) 
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(34) 








with respect to the Lebesgue measure. With the alternative (21) prior, then 


p{(3,9)(x - -- le^g(/3,6>), and p{e) (x r]if3,e)lef^ gi/3,0). 

i + (i) 


( 35 ) 


5 Illustrative examples 


In this section we present some illustrative examples and simulation studies. Since the MCMC 
results obtained by the marginal and joint methods are indistinguishable we present only one of 
them. At the end of the section we study how the methods scale. 

5.1 The mean 

Recall inference on the mean studied in Example]^ Now focus on J = 3 and S = (—1,0,1), so 
(3 = 63 — 0i = 1 — 20i — 62 . Here we have taken the 2 dimensional Hausdorff prior as 

p(/3, d) oc (1 - 01 - 02 )“"^ 1 (min{0i, 02 ,1 - 0i - 02 } > 0). (36) 

We call this a “Laplace-Dirichlet” distribution on (/3,0), where j3 is centered around m and the 
Dirichlet part is indexed by a. 

By the marginal method: 

p(0) oc e- 2 |i- 2 ei- 02 -m| 0 a-i (1 _ 0 «-ii (niin{0i, 02 ,1 - 01 - 02 } > 0), (37) 

Figure]^ shows the contours of p{ff) for various values of m and a. We have plotted these contours 
against ( 0 i, 02 , 03 )^ so the reader can compare 0 i and 03 . 

If the Laplace-Dirichlet distribution has m = 0 then the density is symmetric with respect to 0i 
and 03 . When the location parameter of p(/3) is positive 0i is on average smaller than 03 . Moreover 
as a increases, the variability of p( 0 ) decreases. 

Figure draws the prior for j3. Here the support of the data means j3 is restricted to the real 
line, after observing the support of the data its prior is restricted to [—1,1]. As a increases the 
variance of (5 decreases. For instance the /3’s prior centered at a positive value results in a prior for 
0 tilted toward 03 , even if the prior of 0 is symmetric. In the same way, a more informative initial 
prior for 0 yields a more peaked prior for (3. 
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a = 0.01, m = 0.5 



a = 0.5, m = 0.5 



Q = 5, m = 0.5 



Q = 10, m = 0.5 



Figure 4: Equiprobability contours implied by the Laplace-Dirichlet prior on p(/3, 9) with respect 
to the Hansdorff measure. Plotted is the marginal p{9i, 0^) for several valnes of m and a, with 62 
implied as 02 = 1 — — 03 - This case has J = 3 points of support (si = —1, S 2 = 0, S 3 = 1) and 

r = 1 moment constraints (the mean). 



Figure 5; Illustrating Example 1 (estimating the mean). Plot of p(/3) for several values of m and 
a. This case has J = 3 points of support (si = —1, S 2 = 0, S 3 = 1) and r = 1 moment constraints 
(the mean). Initial prior for /3 is Laplace centered at m and the initial prior for 0 is symmetric 
Dirichlet with parameter a. 
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(3 


Figure 6 : Illustrating Example 1 with the partially observed support: inferring the mean /3. The 
prior and the posterior of the missing element of the support S 4 (left panel) and mean /? (right 
panel). 


5.2 Missing support and the mean 

In the previous section, the finite support of (3 is caused by the known support of the data. We 
now extend this to cover Example where we have a single missing datapoint 

S 4 ~ iV(0,10^), (38) 


all other features of the problem are unchanged. An adaptive MH algorithm has been used in order 
to draw 10,000 samples from the joint distribution. Eor the sake of brevity we present the results 
only for the case of m = 0 and a = 0.5. 

The means and standard error of the probabilities 6 * are (0.3065,0.3335, 0.3079,0.0521) and 
(0.0026,0.0030,0.0024,0.0010), respectively. The left panel of Figure shows the initial prior (38) 
and the implied marginal distribution of the missing element of the support 54 from the joint 
prior. The variance of the implied marginal is smaller than the prior’s variance, because the prior 
distribution of (3 is informative about the support of the data. The right panel of Figure shows 
the Laplace element of the prior and the full marginal prior for j3. The full marginal prior 

is not the same as the Laplace distribution due to the informative priors on the probabilities. 


5.3 Linear regression 

Recall the linear regression of Example 3. Assume the observed data is Z = {( 1 , 1 ), (2,4), (3,9)}. 
Earlier we have seen that the parameter space, is a non-flat surface in M^. Figure demon¬ 
strates the posterior distribution of the parameters defined on this surface (the prior parameters 
are a = 0.5 and m = 3). 

Following the suggested MCMC simulation algorithms we draw 100, 000 samples from the poste- 
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Figure 7: The posterior distribution of the linear regression model with data Z = 
{(1,1), (2,4), (3, 9)}. The prior parameters are a = 0.5 and m = 3. 


rior distribution of the parameters. In the Figurewe have drawn the contour plots of the posterior 
distribution of the probabilities. Analytical results have been compared with the estimates obtained 
by a kernel density estimator using the MCMC draws. 


5.4 Simulation study 

To demonstrate the scalability of the algorithms we consider a linear regression model with sample 
size J = 500. The data Zj = (Yj,Xj), for 1 < j < J, is generated according to Xj ~ AA(1,2^), 
J\f{2 + 5Aj,10^). We assume the substantive prior of /3 is /3 ~ A/’(/ro)^o)) where the 
elements of /ig are equal to the 25% quantiles of the asymptotic MLE estimators, and Eg is equal 


to the asymptotic variance of the MLE estimator multiplied by 100 (see Appendix A.5 for the 
results with a different prior). The initial prior of 0 is a symmetric Dirichlet distribution with 
parameter a = 0.01. We have drawn 50,000 samples from the posterior after a 5,000 sample 
burn-in (the chain’s trace has been thinned with a factor of 100, so has been iterated 5, 000,000 
times). The scatter plot of the sample is depicted in the top-left panel of Eigure Each circle 
represents a data point in our sample and its radius is proportional to the expected value of its 
posterior probability, i.e. '&[9j\Z). In the top-right panel the correlogram (ACE) of the chains of 
j3 and 10 elements of 9 have been presented (the red dashed lines and the blue dotted lines are 
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Figure 8: The posterior distribution of 6 in the linear regression model with data Z = 
{(1,1), (2,4), (3, 9)} (analytical results and the estimates obtained by a kernel density estimator 
using 100, 000 MCMC draws). The prior parameters are a = 0.5 and m = 3. 


corresponding to /3 and 9, respectively.) The ACFs demonstrate that the Markov chain is mixing 
sufficiently well. In the bottom-left panel the contour plot of the posterior distribution of /3 has 


been compared to the one obtained by the Bayesian bootstrapping of Chamberlain and Imbens 


(2003). The posterior distributions are very close, because the prior’s information is ronghly 1% of 
the information content of the sample. The bottom-right panel shows a histogram of the samples 
from the posterior distribution of (3. 


6 Empirical studies 


In this section we study two empirical examples. The first focuses on an instrumental variable 
based estimator, the second looks at estimating the average treatment effect from an experiment. 

6.1 Instrumental variables 


In this section we demonstrate the applicability and scalability of the methodology developed in 
this paper to a real dataset. We use a subsample of the earnings and schooling dataset studied 
in 


Chamberlain and Imbens (2003). This dataset is a subset of the data studied in Angrist and 


Krueger (1991) and consist of the self-reported weekly log-earnings (self-reported annnal earnings 


divided by 52) of 162,512 male subjects who reported positive annual wages in 1979 along with 
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Figure 9: Inference in linear regression model with J = 500. Top left shows circles who’s radius is 
the posterior expectation of the probabilities given the data: E{9j\Z). Top right is the correlogram 
for the thinned draws of the elements of /3 and ten elements of 9. The bottom graphs show the 
estimated contour and marginal densities of the resulting posterior. 


their number of years of education and their quarter of birth date. In turn this is a 5% random 


sample from the 1980 Public Use Census Data. Bound et al. (2001) discuss the myriad of problems 


of self-report income data but we do not address that issue here. For example, Britton et al. (2015) 
compared UK self-reported income with tax based administrative data finding high income earners 
significantly under self-recorded their incomes compared to that seen in administrative data. 


Chamberlain and Imbens (2003) studied the dependence of earnings on the level of schooling 


using a linear additive treatment effect model (e.g. Imbens and Rubin (2015)). They model 
schooling levels as being determined by rational agents’ optimization of their lifetime expected 
utility. Since the utility is a function of the earnings they needed to estimate the distribution of 
earnings as a function of the schooling level. 

The expected log-earnings Yx with schooling level X is modeled here as E{Yx\X,Yo) = Yq + 
PiX, where X is the schooling level, is the unknown return to education, and Yq is the earnings 
level with no schooling at all. Let /1q be the expected value of Yq, so Yq — /3q has a zero mean. 


In order to estimate the unknown parameters, /3 = (/3 q,/ 3]^), we follow Angrist and Krueger 


(1991) and Chamberlain and Imbens (2003) and use an instrumental variable (IV) W that is a 
binary indicator: IF = 0 if the subject was born in the first three quarters of the year and W = 1 
otherwise. The instrumental variable W is correlated with the regressor X and thought by the 
researchers to be uncorrelated with the errors. 

We obtain the classic IV estimate of /3 using the full sample, and treat them as the “true” values 
of 13. Then we draw random samples with replacement of size J from the original data 1,000 times. 
Our aim will be to compare different estimators using these smaller samples. 
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J = 5,000 J= 10,000 

Figure 10: 95% pointwise confidence regions for the marginal prior for /3 for J = 10, J = 1,000, 
J = 10, 000 and J = 100, 000 points of random support in top-left, top-right, bottom-left and 
bottom-right, respectively. The confidence region is computed over 1,000 replications. 


Our prior distribution, which is specified to be weakly informative, is 


p{l3,e) oc 


1 

JqJq + I2 


77(/3)7/(6')l0^_,(^,6»), 


(39) 


where r?(/3) = y? (/3 q; 5,4) (/3;l; 0, 0.2), and (/?(•; ^,(t^) is the Gaussian density with mean p and 

variance cr^. The intercept is centered at 5 with variance 4, implying that the mean annual income 
for those with no schooling is equal to $7, 717 (with 95% confidence interval [$153, $388965]) with 
zero years of schooling. Moreover the prior of has zero mean (no effect of number of schooling 
years on income) with 95% interval [—0.88,0.88] (that is equivalent to [—0.41%, 241%] income in¬ 
crement for each additional year of schooling.) The probabilities 6 are taken as a mildly informative 

J 

Dirichlet prior r]{9) oc where a = 10“® (we also tried a = J~^, with no substantial change 


1=1 

in the results). 

For 1,000 iterations, a random sample of size J has been drawn with replacement from the 
162, 512 population. For each replication the resulting marginal prior distributions of /3 q and 


depend on the draws which generate the support and so vary over the 1,000 samples. Figure 10 


shows the pointwise 95% confidence intervals of the marginal prior distributions over these 1,000 
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Figure 11: The sampling distribution of classic IV (denoted frequentist), Bayesian bootstrapping 
and Bayesian estimators of /? in the linear regression model with the instrumental variable employing 
sample sizes J = 10, J = 1, 000, J = 10, 000 and J = 100, 000. 


replications, for J = 100, 1,000, 5,000 and 10,000. It shows the information content of the prior 
is modest and only mildly depends upon the random support and J, with less variation across 
replications in the prior density as J increases. Similar results have been obtained for other 
sample sizes J. 


For each random sample, we compute the classic IV estimates of /3 and the Chamberlain and 


Imbens (2003) Bayesian bootstrapping estimates obtained by 10,000 draws. For the latter we 


report both the means and the medians as the estimators. These estimators are compared with 
the weakly informative Bayesian estimators (using the prior described earlier). 

The Bayesian estimates are obtained by the following resampling method. Initially a sample of 
size 10, 000 is drawn from a Dirichlet distribution with parameter (ni + a — 1,..., nj + a — 1), and 
the importance sampling weights are computed oc Then a sample from the posterior 

can be obtained by resampling using the normalized weights. Estimators of the mean and the 
median of the posterior have been reported here. For J = 10, 100, 1,000, 5,000, 10,000, 40,000 


and 100,000 the effective sample size divided by J (Liu, 2001, p. 35) was 0.620, 0.576, 0.607, 0.719, 
0.819, 0.978 and 0.997, respectively. This suggests this is a reasonable method for this problem. 


In Figure 11 the sampling distribution of these five estimators have been plotted. The blue 
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Po 

Sample size J 

Bias of mean 

10 1,000 10,000 

Bias of median 

10 1,000 10,000 

10 

RMSE 

1,000 

10,000 

95% CR length 

10 1,000 10,000 

Classic IV 

-0.104 

0.214 

-0.015 

0.285 

0.144 

-0.009 

14.27 

35.41 

0.216 

42.83 

44.52 

0.832 

BB E{e\Z) 

-0.174 

0.909 

-0.020 

0.347 

0.259 

-0.015 

50.01 

27.32 

0.221 

42.21 

45.36 

0.851 

BB med{9\Z) 

0.240 

0.190 

-0.015 

0.287 

0.227 

-0.009 

2.369 

1.247 

0.216 

9.491 

5.137 

0.834 

E(e\z) 

0.323 

0.269 

-0.007 

0.324 

0.292 

-0.003 

0.979 

0.640 

0.211 

3.667 

2.447 

0.815 

med{9\Z) 

0.324 

0.261 

-0.002 

0.326 

0.290 

0.003 

1.034 

0.669 

0.207 

3.837 

2.572 

0.803 


Pi 

Classic IV 

0.007 

-0.016 

0.001 

-0.017 

-0.011 

0.001 

1.100 

2.783 

0.017 

3.398 

3.496 

0.065 

BB E{9\Z) 

-0.001 

-0.072 

0.002 

-0.019 

-0.020 

0.001 

3.940 

2.151 

0.017 

3.402 

3.546 

0.067 

BB med{9\Z) 

-0.017 

-0.015 

0.001 

-0.018 

-0.017 

0.001 

0.186 

0.098 

0.017 

0.725 

0.404 

0.066 

E{9\Z) 

-0.023 

-0.021 

0.001 

-0.020 

-0.022 

0.000 

0.077 

0.050 

0.017 

0.295 

0.193 

0.064 

med{9\Z) 

-0.023 

-0.020 

0.000 

-0.021 

-0.022 

0.000 

0.081 

0.052 

0.016 

0.307 

0.200 

0.063 


Table 1: Results for the linear regression with an instrument using 1,000 replications sampling with 
replacement. The bias of the mean is the difference of the mean of the replications and the true 
value (using all 162,512 data points). The bias of the median is the median of the replications minus 
the true value. The 95% confidence region (CR) length is the length of 95% of the replications 
placing 2.5% of the mass in each tail. RMSE is the root mean square error over the replications. 
BB denotes the (non-informative) Bayesian bootstrap, med denotes median. The last two rows are 
posterior mean and posterior median of the Bayesian model with weakly informative prior. 


curves correspond to the classical IV estimator. They exhibit a very imprecise estimator and assign 
significant probabilities to economically irrelevant values of /3 (this is a well known disappointing 


property of this estimator, e.g. Bound et al. (1995)). The mean of the Bayesian bootstrapping 
estimator of Chamberlain and Imbens ( 2003| ) has a very large variance too (the orange curves), but 
its median is more precise (the yellow curves). The Bayesian estimators (that are the mean and 
the median of the posterior) are the most precise estimators. 

The bias (with its standard error) and the root mean square error (RMSE) of the estimators 
have been reported in Table Although the Bayesian estimators are slightly biased, thanks to 
their small variances they have lower RMSEs. In the Table and Eigure we have also reported 
the length of the 95% confidence intervals of the sampling distribution of the estimators (over the 
1,000 replications) of /3 q and j3i for different sample sizes J = 10,100,1,000,5,000,10,000,40,000 
and 100,000. This shows that the Bayesian estimators are far more accurate than the classical 
IV estimator and Bayesian bootstrapping for most sample sizes. However, when J hits around 
100, 000 the old methods catchup to our techniques. 

Why does our method do better? For weakly identified models even a very modestly informative 
prior, which downweights economically implausible values of the parameter space, has the trait of 
cutting off the tails of the posterior corresponding to these implausible values. Because of the 
ridge-like posterior induced by the weakly informative likelihood, the posterior contracts onto a 
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Figure 12: The length of the 95% confidence intervals of the sampling distribution of the parameters 
/3 q and for different sample sizes J = 10,100,1,000,5,000,10,000,40,000 and 100,000, and 
for classical IV estimator, Bayesian bootstrapping (mean and median) and Bayesian (mean and 
median). The bars denote our estimated 95% confidence intervals estimates of the lengths. 


manifold, rather than a single point. As such, having a prior which constrains the feasible support 
provides significant value. 

In the Appendix |A.6 we have relaxed the assumption that the support of Z is fully observed 
in our sample. It can be seen that the estimates would not change significantly as long as a, the 
parameter of the Dirichlet distribution in the prior of 9*, is small. It can be shown that, when 
a —)• O"*", the marginal posterior distribution of 9 and of both models coincide. 


6.2 Causal Inference 


In this example we analyze a dataset originally collected and studied in Imbens et al. (2001). The 


dataset contains socioeconomic variables of 496 individuals who had won monetary prizes in the 


Massachusetts lottery. Following Imbens and Rubin (2015), we call the individuals who won large 


sums of money “the winners” (237 observations), and the ones who won only small amounts “the 
losers” (259 observations). The goal is to study the effect of unearned income on the economic 
behavior of the subjects, more specifically, on their average labor income over the first six years 
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following the year in which they had won the lottery. For each individual the treatment indicator, 
Wi, is equal to one for the winners and zero for the losers. The uncontroversial assumption behind 
this study is the random treatment assignment, however one may argue that the sample is not 
representative of the population. For instance in the literature it is well documented that the lottery 
players are slightly more likely to be male and middle-aged, with lower income and less education 
(see Clotfelter and Cook ( |1989 ), Farrel and Walker (1999) and Ariyabuddhiphongs (2011), among 
others). 

The dataset includes the year in which the winning lottery ticket is purchased (YW), the number 
of tickets purchased in a typical week (TB), the individual’s age (Age), gender (G) and years of 
schooling (YS), an indicator showing whether she has been working during the year the winning 
ticket is purchased (WT), and the annual social security earnings from 6 years prior to the year 
in which the winning ticket is purchased (EYBl to EYB6) to 6 years after that (EYAl to EYA6), 
all converted to 1986 dollars. The authors argue, perhaps optimistically, that the social security 
income is potentially the most reliable measure of income in long run, although it is capped to the 
maximum taxable earning ($42,000 in 1986). 

In order to improve the overlap of the background variables, following the recommendation of 


Imbens and Rubin (2015), initially we model the propensity scores using a logistic regression model. 


and estimate the model’s parameter using the Bayesian bootstrapping of Chamberlain and Imbens 


(2003). The covariates of the model are a constant, the linear terms TB, YS, WT, EYBl, Age, YW, the 


indicator for the positiveness of the earning 5 years before winning the lottery (SEYB5), G, and the 
quadratic terms YW x YW, EYBl x G, TB x TB, TB x WT, YS x YS, YS x EYBl, TB x YS, EYBl x 
Age, Age x Age, and YW x G. We discard the observations with too small (< 0.0891) or too large 
(> 0.9109) estimates of propensity scores. This results in a sample of size N = 295 (142 winners 
and 153 losers). In the proposed model the propensity score is regressed on 13 covariates using a 
logistic regression. The vector of covariates is denoted by Xj, and include a constant, the linear 
terms TB, YS, WT, EYBl, Age, SEYB5, YW, EYB5, and the quadratic terms YW x YW, TB x YW, TB x 


TB, and WT x YW. For details on the variable selection see Imbens and Rubin (2015). The outcome. 


Vi, is the average of the individual’s income averaged over the first 6 years after purchasing the 
winning lottery ticket. Therefore the parameters of the logistic regression model, 7 , and the ATE, 
r, satisfy the following moment conditions. 


E[5(Zi,/3)] =0, 


(40) 
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in which, Zi = {Xi,Yi,Wi), I3 = ( 7 ,r), and, 


g{Zi,f3) = 


{W,-v,)Y, 


— T 


(41) 


where If we assume ZiS are i.i.d. draws from a discrete distribution supported 

on {si,...,sj}, with F{Zi = Sj) = 6j, the parameters (/3,0) will satisfy the following system of 
equations, 


E/=i ^jXjiVj - Vj) 


= 0 . 


(42) 


r]{-f)rj{T)r]{9)lef,g{l3,e), 


1 


We let the prior of (/3, 9) be 

in which the initial prior of the regression coefficients, 7 ( 7 ), is a normal distribution centered at 


(43) 


their estimates obtained from the Bayesian bootstrap of Chamberlain and Imbens (2003) and its 


covariance matrix is equal to the covariance matrix of estimates scaled by a factor of 100 , and the 
initial prior of ATE is a zero mean normal distribution with variance equal to 100. Moreover we 
use a symmetric Dirichlet distribution with parameter a = 10“® as the initial prior on 9. 


By reweighting draws from the posterior distribution of the Bayesian bootstrap of Chamberlain 


and Imbens (2003), we obtain 10,000 independent draws from the posterior of our model. An 


estimate of the posterior distribution of the ATE is depicted in Figure A posteriori the expected 
value of ATE is —$5,346 (with 95% credible interval of [—$8,069, —$2,720]). This indicates that 
the average income of the winners of the lotteries, in the years after winning the prize, tend to 
slightly decrease. Our estimate of ATE is only slightly different from the frequentist estimate. 


7 Conclusions 


In this paper we have provided a coherent Bayesian calculus for rational nonparametric moment 
based estimators, allowing users to specify scientifically meaningful priors. At the core of our 
analysis is a prior density placed on the Hausdorff measure whose support is generated by the 
scientific parameters of interest and the nonparametric probabilities. We show how to transform 
this prior into a posterior density. 

Much moment based analysis favoured in the literature delivers weakly identified parameters. 
The use of very modest priors can dramatically improve estimation by downweighting vast regions 
of economically implausible parameter values. Such weak priors play little role when the data is 
informative but provide a safety net when this is not the case. 
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Figure 13: The posterior distribution of the average treatment effect (ATE) on subsequent annual 
earnings of a substantial lottery win for the lottery data set. 

To harness these gains, at the center of our paper are the marginal method and the joint method. 
The first is based on finding the density of the probabilities with respect to a Lebesgue measure. 
This allows for the use of conventional simulation methods such as MCMC, importance sampling 
and Hamiltonian Monte Carlo. It is convenient to use where the moment conditions can be solved 
analytically or numerically very fast. 

Our joint method is somewhat harder to code but has the virtue of never having to solve the 
moment equations. This has some speed advantages but more fundamentally allows the rational 
analysis of moment condition models with many solutions. As a side product our method provides 
a novel way of generically simulating on a wide class of manifolds, which may be useful in other 
areas of science. 
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A Appendices 

A.l Proof of proposition 

Since corresponding to every 9 G &g there is a unique /3, there exist a one-to-one mapping between 
QjS^g and Qg: {(3,6) = {(3{9),6)} = F{6). Now let A be a measurable set on Qjs^g, and assume 
50 (A) is its projection on Qg. Therefore 

P(50(A)) = P(A) = [ p(/3, 6)dA = [ ||ui A • • • A uj_i|| p(/3, e)dS 
Ja dSe{A) 

where vj = ^ (for 1 < j < J — 1). Therefore ||ui A • • • A vj-i\\p{(3,9) is the density of 9 with 
respect to Lebesgue measure. Moreover, 

1 1 1 I ,1 

||ui A • • • A vj-i\\ = [Gram{vi, ...,vj-i)]^ = \JgJg + = \JgJg + Ip\^ 

where Gram{-) is the Gramian determinant and Jg = 8(3/89 . 


A.2 Proof of proposition 


Let p{(3) be the density of (3. Then, given (3, the vector of probabilities 9 lives on a J — 1 — p 
dimensional hyperplane in defined by HO+ gj = 0. This system of equation can be solved for 
p elements of the variables = —Hf/^ {Hi9ij-p-i — gj), where Hi = [hi ... hj-p-i] and 

H 2 = [hj-p ... hj_i]. Therefore, 86j-p-j-i/89i.j_p_i = -Hf^Hi and so 

1 

p(0i:j_p_i|/3) = Hf^HiH'iH'^-^ + Ip~^ p{9\fl), 

1 

p(0i:j_p_i,/3) = Hf^HiH'iH'^-^ + Ip^^ p{fl)p{9\f3). 
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Therefore the density of Q is 

9(6»l:J_p_l,/3) 


p(6») = 
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Therefore: 


p{(3,9) = 


dp dl3' 
89 ' 99 


mm: I T 
89 ' de + h 


-p{P)p{9\^). 


A.3 Joint method proposal 

In order to generate a proposal value for 6 *, we can first draw vr* from N{9,Tiq)^ and let 6 * be 
the closest point to tt* in the hyperplane "P* = {A G H*\ + g*j = 0}, where we measure the 

distance between vr* and 9* with the squared Euclidean norm: 

9* = argmin - ||7r* — 6\\2 + - {l'tt* — l9)^ . 

6» 2 2 

The quadratic penalty is certainly inelegant (e.g. compared to the log-likelihood of the multinomial 


model, but see, for example, Owen (1991) and Antoine et al. (2007) who use it for their Euclidean 
empirical likelihood) as the resulting 9* can have negative elements or may result in 9*j = l-i'9* < 0. 
However, by using a quadratic penalty, 9* becomes the solution to a quadratic optimization problem 
subject to p equality constraints, and so has an analytic solution 9* = a* + B* 7 r*. 

The Lagrangian of the optimization is, 

E(9, A) = llvr* - 9\\l + {l'tt* - i' 9 f + X'*9 + g}) 


and the first order conditions are: 

8 E 


8 E 


89 


Solving them for 9* and A results in. 


— = (I + u'* -Tr*)+H* X = 0, — = H*9* +g*j = 0. 


/l-i 


9* = TT* - {I + + {H* 7 r*+g*j) 
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A = 




-1 


Therefore 9* is an affine transformation of vr*: 0* = a* + B*tt* , where 




' 1-1 


9j 




J-l ; 


' 1-1 


H*. 


This transformation from vr* to 6 * is a many-to-one affine transformation. Consequently, 
9*\/3*, is a singular normal distribution with mean a*+B*6^^^ and variance matrix B* ^qB*. 

A singular normal distribution with mean fj, and (singular) variance matrix S has a density on 


the range of the covariance matrix (e.g. Khatri (1968)), given by 

S+ (x - ^)| , 

where \'^\rank{T,) is til® product of non-zero eigenvalues of S and is its Moore-Penrose inverse. 

In our algorithm. Eg and the parameters inside 0^*^) are the tuning parameters. We 

may either adapt them in the course of simulation, or they can be set to some fixed values obtained 
from an estimate of the posterior’s distribution. Here we document how we have carried this out 


proposal is = ^Eq^^ + E ^ 


BBp, 


for our simulation and empirical work. A simple to calculate candidate for the covariance of /3’s 

\-i 

A , where Eq^ is the prior’s covariance and '^bB/s i® covariance 

of the estimates of /3 obtained by Bayesian bootstrapping of Chamberlain and Imbens ( |2003 ) (As 

an alternative we may use the asymptotic covariance of the least squares or GMM estimators). 

''2 ''2 

Moreover a suitable candidate for Eg is diag{9i, where: 


9 = {9i ,..., 9j-i) = argmax > nj lii9j subject to H9 + gj = 0, 


(44) 


in which H = {gi, ...,gj-i) - gjd, gj = g0,Sj) and ^ = (Eq^ + ^bb^) 


-1 


+ Ebb /iQ, 


A.4 Large support 

An apparent drawback of the joint method is that in each evaluation of the proposal’s density, the 
Moore-Penrose inverse of the (J — 1) x (J — 1) matrix B*TiqB*' should be computed. In general 
this costs 0{J^) computational operations. This type of challenge is very common in Bayesian 
analysis and a standard approach to this problem is to make proposals to update a block of A <C J 
elements of 9, with cost 0{K^). 

Let the K x 1 vector u be a randomly (without replacement) selected subset of the indices 
{1,..., J — 1} and the (J — A — 1) x 1 vector v be its complement. Moreover let 9 = {9ui, •••, 9uk) 
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Figure 14: Inference in linear regression model with J = 500 and an informative prior. 


and 9 = 9vj_j^_^). The proposal’s vector of probabilities, 9*, is equal to 9 except for the K 

elements with indices \nu,9 = (0^^, ■■■■,9*^^)^ that is obtained by solving: 

fT*a{t) 


9* = argmin — 7r*|| + ^ (^l'9 — subject to + H*9 + g} = 0, 


(45) 


where H* = {gZ^,g^j,) - 9*ji^\ H* = \^gZ^, gZj_j,_:,j - 9*ji^' , and vf* is a random draw from 
N{9,Tjq). Again this is a quadratic optimization problem subject to a set of equality constraints 

j(5 

with the following solution: 9 = a* + B*tt* , where 




1—1 fj*' 


-1 


TT*ni^) 




.1-1 TT*' 


/l-l 


H*. 


A.5 Linear regression with an informative prior 

Here we report the resnlts for the linear regression model with sample size J = 500, and an 
informative prior for (3. We place a normal prior on j3 with the mean equal to /Jmle + (5) —5)' and 
the variance eqnal to the asymptotic variance of /3 mle- Therefore the prior is as informative as the 
data, however centered at a significantly different point. 


Fignre M's top left panel shows a scatter plot of the sample. Each circle represents a data 
point and its radius is proportional to K{9j\Z). In the top-right the ACF of the chains of /3 and 
50 elements of 9 have been presented (the red dashed lines and the blue dotted lines correspond 
to {3 and 9, respectively.) These show that the Markov chain is mixing sufficiently well. In the 
bottom-left panel the contonr plot of the prior distribution (bottom), posterior distribution of (3 
using Bayesian bootstrapping, and the posterior distribution of f3 considering the informative prior 
(middle) have been depicted. In the bottom-right panel the histogram of the samples from the 
posterior of (3 can be seen. 
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A.6 Instrumental variables with partially observed support 

Now we assume the support of S' = (A, V, W) has other J missing elements (not observed in the 
data), therefore J* = 2J. The density of our prior for the missing elements of the support is, 


fs{s) = /5(i)(s(^))/5(2)(s(^))/5{3)(s®) 


(46) 


in which and are the density of a uniform distribution on {0,1,...,20} and 

{0,1}, respectively, and is a normal density with mean 6 and standard deviation 3. 

Moreover we assume, 


p{P,e*\S) oc ^ ^^(/3),^(r)le ,(/3,r,S), (47) 

Js4 + /2 ’ 

where ry(/3) = (/9 (/3 q; 5,4) (/9 (/3]^; 0, 0.2) (similar to the previous case), and r]{9*) is the density a 
symmetric Dirichlet distribution with a = 10“®. Hence the posterior distribution of {6*,S) is: 


p(r,5|Z)ocr/(/3) n fs{Sj) H' 


(48) 


U = 1 


vi=i 


To sample from this distribution we can reweight random draws from the following proposal, 


gi0*,S\Z)<x n fs{i 




flT" 

vi=i 


(49) 


with he weights proportional to r/(/3). Now we set J = 10, and for 1000 times we draw a random 
sample from our dataset. Then we compare the posterior distribution of the parameters under two 
assumptions. In the first model we assume the support of S is fully observed in the data (similar 
to the previous section of this example), while in the second model we assume the data has J = 10 
more elements that are not observed in our sample. Since the prior of the probabilities and /3 is 
barely informative the posterior distributions of (3 are almost indistinguishable under these two 
assumptions. 


A.7 Not the just ideutified case 

A.7.1 Abstract expression of the problem 

Collect all the parameters in the model and constraints as 


V’ = {01, •••, /?!,..., /3p)', gii)) = 0^. 
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Then resulting constrained support is G 0^. Write A = V’l) 4> = where X selects distinct 
indexes of ip and is the complement, soXUX'^ = + 1}. Throughout we take 

dim((/>) = r and consequently dim(A) = J — m, where m = r — p + 1. 

Given the freedom to build X we make the following assumption. 

Assumption A. Under gpip) = 0 knowledge of A reveals cp, so there exists a unique 4> = t{^)- 

A.7.2 Marginal method 

Under Assumption A, the area formula implies that p{X) = p{^p)^J U + > J<j>\ = dcp/dX', 

where p{ip) is a density with respect to the {J — 1 + p — r)-dimensional Hausdorff measure on 0^, 
while p{X) is a density with respect to the J — m-dimensional Lebesgue measure. 

A.7.3 Underidentification 

Definition 1 If r < p (so m <0) then the system is called underidentified. 

We split fi = {fii, ...,/3p)' as (5^^ = j3j, /Jp] = fijc, where J \J = {1,2, ...,p}, dim(J^) =p-r 
and dim(J"‘^) = r, and build A = (0i,..., 6*j_i, /3j^])', (p = /Jp]- Hence A augments 9 with p — r 
elements from fi. Assumption A holds if J can be found such that 

Example 6 Consider instrumental variables problem g{s,j3) = — fi', dim(s(^^) = p, 

dim(s(^)) = r. If p > r then split fd = /3j2]^ ; where dim(/3[]^]) = r — p and dim(/3[2]) = r. 

Write Sj = s'. [ 2 ]^ , then 





Knowledge of puts us back to the just identified, so Assumption A holds under weak assumptions 
and so p{9,pd^2]) be computed using the area formula. 

A.7.4 Overidentification 

Definition 2 If r > p so m > 1 (e.g. r = 2,p = 1, m = 2) then the system is called overidentified. 


We split 9 = {9i ,..., 9j-iy as 9^^ = 9j, = 9 j'c,where J U J"" = {1, 2,..., J — 1}, dim( J") = 

J — m and dim(,X‘^) = m — 1, and build A = 9'^^, (p = ^0j2],/3'^ . Hence A is a subset of 9 with 
J — m elements, while (p contains all the other probabilities and the entire jd. Then Assumption A 
holds if we can find a J such that 
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Example 7 Again consider g{s,j3) = — /3's^'^^, dim(s*^^)) = p, dim(s*^^^) = r. If p < 

r then split 6 = , 0 j 2 ] ^ , where dmi{ 6 '^ 2 p t^') = so there are r moment conditions and r 

unknowns. Given 9]^^, we can then solve for the extended set of parameters where 


dim(0[i]) j 

i=l i=dim(0[y)+l 



This is typically exactly identified, but non-linear due to the Oj/I terms for j = dim(6l[;^]) + 1,..., J. 
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